Skip to content

feat: separate Dockerfile for Hadoop #1186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 14, 2025
Merged

Conversation

dervoeti
Copy link
Member

@dervoeti dervoeti commented Jun 20, 2025

Description

Follow up for #1173

This PR looks like a lot but it basically just does one thing:
We had a single Dockerfile for Hadoop. This PR separates the build from Hadoop itself from the image build (which also includes things like hdfs-utils). We do the same thing for HBase already, were we have one Dockerfile for the image and another one just to build the HBase JARs.

For Hadoop, this has two advantages:

  • Currently, our HDFS image includes /stackable/patched-libs, which contains the Hadoop libraries other products use as dependencies. These are not required in the HDFS image though. But because other projects depend on hadoop and need to COPY these libraries form the image in order to use them as dependencies, they had to be part of the HDFS image. Now products can just depend on hadoop/hadoop instead, so just on the Java build itself.
  • Builds of products that depend on Hadoop (HBase, Druid, Spark, Hive) should be faster, because things like hdfs-utils don't need to be built now. Just Hadoop itself.

In the future we could even think about not copying everything from hadoop/hadoop into hadoop, but just the components required to run HDFS.

One other change:
Since #1173 we build everything of Hadoop, before this PR we skipped Yarn, Mapreduce and Minicluster. These components should be part of the libraries, so that other products can use them, but they should not be part of the HDFS image (they weren't before, now they are). So we just remove them from the distributed JARs after the build now.

I tested building Hadoop 3.3.6 and 3.4.1, as well as Spark 3.5.6, Hive 4.0.0 and Druid 33.0.0. I also ran the smoke tests for the built images. Everything succeeded.

Definition of Done Checklist

Note

Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant.

Please make sure all these things are done and tick the boxes

  • Changes are OpenShift compatible
  • All added packages (via microdnf or otherwise) have a comment on why they are added
  • Things not downloaded from Red Hat repositories should be mirrored in the Stackable repository and downloaded from there
  • All packages should have (if available) signatures/hashes verified
  • Add an entry to the CHANGELOG.md file
  • Integration tests ran successfully
TIP: Running integration tests with a new product image

The image can be built and uploaded to the kind cluster with the following commands:

bake --product <product> --image-version <stackable-image-version>
kind load docker-image <image-tagged-with-the-major-version> --name=<name-of-your-test-cluster>

See the output of bake to retrieve the image tag for <image-tagged-with-the-major-version>.

@dervoeti dervoeti self-assigned this Jun 23, 2025
@dervoeti dervoeti moved this to Development: Waiting for Review in Stackable Engineering Jun 23, 2025
@maltesander maltesander self-requested a review July 1, 2025 13:14
@maltesander maltesander moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Jul 1, 2025
@dervoeti
Copy link
Member Author

I see that #1201 and #1204 have been merged in the meantime, I'll revert most of the changes made in these PRs since they are obsolete once this PR is merged.

@dervoeti dervoeti requested a review from maltesander July 14, 2025 12:59
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

@dervoeti dervoeti added this pull request to the merge queue Jul 14, 2025
Merged via the queue into main with commit ab69baf Jul 14, 2025
3 checks passed
@dervoeti dervoeti deleted the feat/separate-hadoop-dockerfile branch July 14, 2025 15:00
@dervoeti dervoeti moved this from Development: In Review to Development: Done in Stackable Engineering Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Development: Done
Development

Successfully merging this pull request may close these issues.

3 participants